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A CONFIGURABLE mCROPROOBSSOR ARODlIXn^ INCORPORATDSrG 
DIRECr EXECUTION UNIT CONNECnVTlY 

TECHNICAL FIELD 

Tbe present inventioa is in the field of digital computing systems. In particular, it relates to the 
internal aichitectore of a configurable microprocessor system. 

BACKGROUND ART 

Much of modem microptocessor design is focused on achieving hi^er levels of parallelism in 
instmction execution. Hds increases die throug^ut of the processor at a g^ven dock 
fi:equency. Moreover, in the context of embedded systems where power consumption is ofi:en 
a significant consideration, it allows the same level of performance at a lower dock firequency 
and dius saves power. A key problem in achieving hig^ levds of parallelism is die design of a 
centralized register file. 

As the levd of parallelism in the instmction stream increases so does the number of access 
potts required to a centralized register file. They are required to provide operands to and wdte 
back results £com all tbe active fimctional units. The complexity of the reg^ter fib grows at 
approximate^ where N is the number of access ports, lite r^;ister file soon becomes the 
bottleneck in the des^ and starts to have a strong detrimental afifect on the maximum dock 
speed. 

This scalabilily issue is fixcdier hampered by the nee4 to provide an extensive network of feed- 
forward buses between the varbus access ports. ^Eleg^ster read and write operations are 
^icalljr performed in dififerent stages of the execution pipdixic Howevet^ in order to achieve 
hi^ code performance it is a requirement that an instruction can pass its results onto an 
immediate^ foOowing instructioiL Such an instruction is executed just one dock cycle later 
(jpresuming the itistruction onfy takes one dock cyde ix> perform). This requires that the 
register file can detect reads and writes being performed on the same dock cycle to die same 
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tegster and provide spedal fotwatdkig buses to dkecdy tcansfist the data to the reading unit 
withouthaviogto wnteto diercg^lsfiGf filefu^ and die 

uequitement daat evetjr write port has to be compared against evety read port, this creates a 
vaj challenging ckcuit design. Moreover, it is widiin the cdtical path of the processor p^efine 
and has a direct impact on maximum dock fiiequenc^ for \dioIe processor. 

Some Very Long Instruction Word (VUW) architectures have adopted a clustered approach 
to help alleviate this issue. In this model the functional units are partitioned into clusters, each 
having a private register file. Coxxununication between dusters requires one additional dock 
cyde of latency. Thus performance suiSte if diere is significant communication between 
dusters. Code generation for such machines seeks to minimise the number of data transfers 
between dusters. 

Another approach is that undertaken withia the fidd of Transport Tri^ered Architectures 
(ITA). Code for TTAs controls transports rather than operations. That is, the instruction set 
specifies how data items are moved around the noachine to diflferent fiinctional units. It is 
transport radier than operation centric in nature. By explidtiy managing the transport of data 
between fimctional lihits and die r^ter file, a TTA is able to reduce the total number of 
access ports required to the r^ter file. Moreover, a TTA e3q)lidriy schedules die transport of 
data over feed-forward buses and thus avoids die need for complex register number 
compaidson log^c. 
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SUMMARY OF INVENTION 

The disclosure describes a ptocessor microatchitectufe targeted at use in embedded systems 
where there is significant repetition of the code sequences that are executed by the processor. 
Hie microarchitecture is designed to be higjbly configurabk in order to support an automated 
. processor generation medioA Such a mediod anafyses application software and automatically 
architects a processor architecture with functional unit and connectivity resources that reflect 
the requirements of the code sequences within the application software. The disclosure 
pro\ddes a h^y configurable and scalable microarchitecture to support such a design 
trajectory. 

A two-tier register file structure is used There is a main r^ter file but k has a very limited 
number of access ports. The code generator seeks to minimise die number of r^^ster file 
accesses by passing data values direcdy between functional units and intermediate holding 
registers without passmg them through the register file. Moreover, reads and writes to the 
register file are expliddy generated by the code generator like any other operation. The register 
file is treated like any other functional unit in the processor and has no special status. 

Each functionalunithais output raters fi>r holding its results. Opeeands tot fimctional units 
are obtained via mul^lexers that select results fiom a number of dififerenttesuk registers. The 
escecution words include the selection settings fi^r these muh^lexers on each cbdk cyde. Thus 
rather than specifying a r^ternumbei^i^t^ere an operand is to be read or written, it specifies 
the bus on \iiuch a particular data item is available. The code generator is aware of the 
structure of the buses in the processor and controls them abn^skie the fimctiotial units 
themselves. The majority of data values ate passed fcom functional unit to functional unit 
without even passing through the register file. 

If every fimctional unit could read fix)m any resuk r^^stet then the problems of the 
centralized register file wouM returo, due to the level of connectivity' to the multiplexers. 
Q>nnecttvity in the architecture is generally minimi?'^ and focused on the connections that 
pro^e the most impact on overall performance. Thus certain fundional units tnay have to 
communicate data that are not direcdy connected. To support this certain functional units ate 
able to copy data fix>m their input operands to their outputs, lliat way data can be traiisported 
around the fiuictionaiuiuts as required usiog copies tfarou^ functional units. 
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The niictoatchitecttite also includes a bianch mechanism diat allows die actual execution of a 
blanch to be decoupled £tom die point of branch issuer using relativdiy simple hatdwate 
mechanisms. It allows the mictoatchifiectute to choose fix>m one of a number of issued 
btanches to actua% execute. This can be used to teduce the number of branches performed 
and die disruption caused to die execudon pipeline by die execution of such branches. 
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BRIEF DESCRIPTION OF DRAWINGS 

Figme l.iOustxates die copying of data through a fimctiotial unit 

Figute 2 shows the gpoeiai atchitectute of a fbncdooal unit and its connectivity to other 
blocks widuQ lihe a»:hil£Ctu£e. 

Hgui£ 3 shows an exanipk logjical Isg^ut of fi^ 

Kgure 4 illustrates how the execution wotd is used to control the state of operand 
multipkxets in the atchitectute in order to control data fbw in die system. 

Figure 5 shows the decomposidon of the Next Region Address into ibs constituent 
components. 

Figure 6 provides an illustcation of how execudon words ate formed into regjions diat have 
pardcular control flow relationships. 

Figure 7 shows an overview of die internal architecture of the branch control unit 

Figure SiDusttates how the execution word can be broken into a number of dififetent groves, 
each of whidi can be used £ot control of a particular functional unit 

RgVire 9 provides an overview of the components within a functional unit 

Figure 10 also provides an overview of the components within a fimctional unit and also 
provides information about the data and control connectivity between the components. 

Ejgure 11 provides an overview of the intetnal architecture of a conditional functional unit 
controller. 

Figure 12 provides an overview ofthe internal ardiicecture of an unconditional f^ 
controller. 

Figure 13 provides an overview of the internal architecture of an operand selector. 
Figure 14 provides an overview of the internal architecture of the delay p^eUne. 
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Egute 1 5 piovides an overview of die internal aidbitectute of the output baolc unit 

Figure 16 illustrates the data flow between various pipeline stages of due to interactions 
between funcdonal units of difFeritig latencies. 

Egpre 17 illustrates a timeline show data flow between fiinctional units in the datapath of an 
example processor. 

Figure 18 iUusttates a timeline of the events that occur at the end of a region execution that 
allow execution of a new region to be initiated. 

Figure 19 provides a state transition diagram of the states widiin the branch unit 
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DESCRIFnON OF PRESENTLY PBEFEIUSED EMBODIMENT 

This disclosure describes die undei^dng microarchitectxire of die pteferred embodiment It 
shows how instrucdons ate fetxzhed, decoded and directed tOTvatds the appropriate execution 
unit It also sho\7s bow the branch contcol mechanisms are implemented 

The pMosophy of die microarchitecture is significandy di&teat fiom contemporaty RISC 
and VLIW architectuiies. These architectures tend to-be veiy operation centric in dicir nature. 
The instmcdon set consists of several dififetent operations tiiat are executed on one of a 
number of execution units. Each of diese instructions reads opecands fiom die central reg^ter 
file and writes all results back to the same central register file. The instruction format consists 
of the specification of die operation and die reg^ter file location of die operands and result 
The progcammer does not specify die buses diat ate used to transport data to and horn die 
execution units. Indeed, diese buses are architecturally invisible at die instruction levd. In a 
higjbty pipelined architecture die bus structures are actually very complex as multiple bypass 
padis also have to be present to albw die register fik to be pipelbed Ihe 
central botdeneck of die architecture diat needs to be connected via buses to all execution 
units in die system. To support multiple parallel operations it also needs to sif)port many 
simultaneous read and write access ports. 

As feature sizes of modem VLSI technology are reduced, die distance that can be panned 
over die chip in a single cyde is rapidly reducing, ^^te propagation delays are starting to 
dominate over ^te d£%s. The buses diat connect systems tcgedier within a processor are 
starting to become much more important to the overall performance of die system. Ihis is' 
not sufi5den<fyrefiected in the architectural design of processors. 

The preferred embodiment is a h^jbiy communication orientated architecture. It is the 
position of the bits in die execution word that specifies which operation should be performed . 
The bits themselves explidtiy specify which buses should be used to transport operand data 
into an execution unit All data buses in the architecture are under eq)lidt software control 
There are no hidden, bypass or feed-forward buses. 
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Although die atdiitecture does have a centcal r^^istef file it is txeated like any othet implicit 
fiaactioGal unit M accesses to the register Be have to be e^liddy scheduled as sepatate 
opecadoos. Since die register file acts like any odier fiiocdonal unit its bandwiddi is Hmited 
Ihe code is constmcted so diat die majotily of data values ate cotnmunicated dttectfy between 
fiincdonal units wtdiout betog wdtben to die tegistet file. 

Given die xegtutement to make die afchitectute highly scakble, communicadon of all data 
ditov^ a centralised te^ster file is not a viable atchitectural opdoa Whenever a fimcdonal 
unit genetates a iiesult it is hekl ki an output register until es^licidy ovetwdtten by a 
subsequent opecadon issued to die unit Dunng diis time the functional unit to vdbidi the 
result is connected may read it 

A singjie fimcdonal unit may have multiple ouq>ut tegistets. Each of diese is connected to 
different fimcdonal units or fimctional unit operands. The output isg^sters that ate overwritten 
by a new result from a fimcdonal unit are programmed as part of the execution word This 
jdlows the fimctional unit to be utilised even if the value fixmi a patticukt output register has 
yet to be used It would be hi^y ineffident to kave an entire fimcdonal unit idle m order to 
preserve the result latched on its output In efiSsct each fimcdonal unit has a small, dedicated, 
output register file asso dated widi it to preserve its results. 

An example fimctional unit array is given ki Figqre 3. The roister file unit 301 is pkced at the 
centre vsisii other fimctional units 302 pkced as required by the applicatbn of die processor 
ardutecture. Given the connectivity limkations of the fimctional unit array, not every unit is 
connected to evety other. Ihus in some circumstances a data item may be generated by one 
unit and needs to be transported to another unit with vMch there is no dkect connection. Ihe 
placement of the unks and the connections between them is spedficaDy deseed to mkiknise 
the number of occasions on which tfa^ occurs. The intercoimecdon network is optimised for 
the data flow that is characteristic of the required applk:ation code. Ihe microarchitecture also 
kidudes an instmcdon cache. It stores a subset of the code used to control the operation of 
the fimctional units. A new execution word is fistched on eadi clock cycle and distributed 
throu^out the fimctional unit array in order to orchestrate the issuing of operations and the 
steering of data between fimctional units. 
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To allow die txansport of such data items, any fimctiotial unit majr act as a tepeatet: That is it 
may setect one of its opecands and simply copy it to its output widiout any modification of the 
data. Hius a particular value tnay be transmitted to any operand of a particular unit by using 
functional units in cqpeatjst mode, A number of individual 'liops" between fiincdonal units 
may have to be made to reach a particular destination. Moreover, there may be several routes 
to the same destination. Hie code generator selects tlie most appropriate route dq)ending 
upon oilier operations being performed in parallel 

There are undet^^ing rules that govern how functional units can be connected togedia. Local 
connections are primarity driven by the predominate data flows between the units. I^igjher 
level mies etisure that all operands and results in the functional unit array are fcSfy reachable. 
That is, any result can reach any operand via a path through the art^ using units as repeaters. 
These rules ensure that any code sequence involving the functional units can be generated. 
The performance of the code generated will obviously depend on how well the data flows 
match the general characteristics of die application. Code diat represents a poor match will 
icequire much more use of repeating througjh the array. 

Region Based Execution 

In the preferred embodiment all execution is performed within blocks of code called regions. 
This simplifies the implementatim of both the instruction scheduling and the control 
mechanisms in die hardware. 

A r^on is a block of code diat onfy has a single entty point but potentially many exit points. 
The analysis performed by the code generation tools is able to form groi5)s of basic blocks 
into regions. R^ons are often used as the basic arena in -which global scheduling 
bptimisations are performed. Global scheduling refers to the movement of instmctions across 
branches as well as widiin individual basic bbcks. 

In die architecture, regions are always executed fully. If the region contains a nutnber of 
internal branches to basic blocks outside of the region then tiiey are not resolved until the end 
of the region reached The compiler constructs the regions firom basic blocks so diat they 
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contain the most likefy execution patbs tfatou^ the basic blocks. A tegbn is able to petfotm a 
rnxMrway btaach to select one of a nutnber of dififeieat successor i?^ons. 

F^ure 6 iDustcates an example set of t^ons 601 and tbe teladonships between tiiem. It 
shows the execution of the individual basic blods 603, 604 and 605 vnAm each tegcm. The 
regions diemselves are composed of individual execution woids 602. The set of contcol edges 
606 ficom eadi r^on shows the possible successor fix each seg^oa 

Execudcm Wocd RqxresentaliQn 

The prefetted embodiment uses a Vety Laige Instwction Wori 

many parallel opecadons to be initiated on a sin^ dock cycle, enabling significant paialldistXL 
The actual width is configurable. Shorter vAiA^ tend to be mote efficient in tertns of code 
density but poorer in extcacdngparallefism firom the application. 

The instruction format is not fixed either and is dq>endent upon the execution units the user 
defines for a particular processor. Unlike many contemporary VIIW architectures, the 
architecture uses a flat decode structure. This means that a particular execution unit is always 
controlled fix>m a specific gtoup of bits in the execution word. This makes the instruction 
decoding for the architecture very strai^tforward. Other VLIWs tend to bundle a number of 
independent operations into a sing^ instruction word. They still require quite complex decode 
logc to direct difibcent operations to the appropriate execution units. 

Rgqre 4 illustrates die basic instmction decode and control patiis of the processor. The 
instruction memory 404 holds die rqsresentation of the operations in the customized format 
for the processor. A new execution word is fetched on each dock cyde. Each block of bits 
405 in the execution word is used for controlling a particular execution unit 401 . 

The bits in the execution word are used to control multiplexers 406 diat direct data firom the 
interconnection network to the operand inputs of the execution unit Results firam the 
execution units are routed back to die interconnection network to be used by subsequent 
operations. 
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A btandb control unit 402 allows the atchttectuie to execute new blocks of code by loadiog a 
new value in the PC (Pxogpm Counter) 403. If a htanch is not executed dien the PC is just 
inctemented on each cyde to execute code sequentfa% fiom the instruction memory. 

The code is stoted m 32 bit width words in main memory and transferred to a wider 
instrucdoiu cache prior to actual earccutton. The instruction cache has a certain capacity to 
allow particular code loops to remain cached without continuous access to main memory 
required. The wider instruction bufibr can be configured in size to support power 
consun^tion and area goals. 

All bits within the execution word have positional context There is a direct relationship 
between particular bits and the functional units that they control This gcea^ simplifies the 
execution word decoding task. The appropnate bits for apatticdar functional unit are dmply 
touted fiom the execution unit word as required 

The basic structure of the execution word is illustrated in Figure 8, The execution word 801 is 
subdivided into a number of groups 802, Each group 803 controls one or more functional 
units 804. Hie number of bits required to control a given fbncdonal unit is related to die 
number of selectable sources for die unit's operands and the number of output r^jsters for its 
results. As the number is increased the number of bits required to uniquely specify them 
grows. 

If a groTjp controls more than one functional unit then bits widiin the gpyup may be shared A 
selection code is dien used to indicate isduch particular functional unit is selected on each 
clock cycle. This overlay mechanism allows direct trade-off between code detisity and 
parallelism (Le. performance) independendy of the functional unit selections. A tiarrow 
execution word forces more fimctional units to share groi:qps. If more than one of diose units 
could be utilised on a particular cyde then performance may be lost as only one may be 
selected Howevei^ a narrow execution word increases the chances that all gcoi^s axe usefidfy 
employed on each dock cyde and thus code density is improved If a wider execution word is 
employed then greater parallelism is possible (as there is less sharing within each group) but 
gto ips are more Vkdy to gp unused on any particular dock cycle. 
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Hie whdk of die execudon word is used for CQntix>lliQg fiancdonal units apart £com one hit 
This is the Regioa Flag (^BF) and is used to indicate that the last execudon wotd in a 
tegton has been leached 

Funcdonal Units 

The microarchitecture includes a configurable number of funcdonal units. Each of those 
functional units performs a particular operation upon a number of data operands to produce a 
number of results. The functional units are pipelined and units with different latencies may be 
beefy mixed in die microarchitecture. The functional unit types and connectivity may be 
configured as required In die prefiaxed embodiment this configuration is determined by an 
automated analysis that finds the functional unit mix and connectivity diat is best matched to 
the tequitements of the ^plication code that the processor is to execute. 

The internal architecture of functional unit is in Figure 2 The central core of a fimctional unit 
203 is the execution unit itself 201. It performs the particular operation for the unit These 
blocks allow the functional unit to connect to other units and to allow the unit to be 
controlled £com the execution word 205. 

Functional units are placed within a virtual array arrangement Individual funcdonal units can 
only communicate with near neighbours widiin this array. Hiis q>atial kyout prevents the 
architectural synthesis generating excessive^ long interconnects between units that would 
significant^ impact dock speed 

Hdds widiin the execution word control the operand mult^kxers 206. These are responsible 
for selecting the correct operands 202 to present to the execution unit In some ciccumstances 
the operand m^ be fixed to a certain bus, removing the requirement for a muldpiexer. Hie 
number of selectable sources and the choice of particular source buses are completely 
configurable. The control input 207 determines the type of operation to be performed 

All results firom an execution unit are held in independent output r^;isters 204. Tliese drive 
data on buses connected to other functional units. Data is passed firom one functional unit to 
another in this manner. The output r^jlster holds the same data until a new operation is 
performed on the functional unit that esplicidy overwrites the register. 
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The functional units tqptesent the building blocks of the piocessot. The selection of 
fiinctional units ceptesents the bask: configurability of the arcfaitjecture. Functional units may 
be sdectied as tequited for a patdcular application domain. Hie ODnnections between the 
fiuictional units form tiiem into constituent components of a fully programmable processor. 
Individual functional units may be replicated as required in order to e^bit paralMsm in the 
software taigeted at the processor. 

Embedded within the functional unit is the execution unit This is the block that actually 
perfomas the required operations. The execution unit is surrounded by additional bg^c that 
aflows the esecudon unit to be controlled by software as part of a processor and to 
communicate ^th other funcdonal units. All inputs to the execution unit are selected &om a 
number a number of data buses. These buses communicate data between tiie individual units 
\dtfain the processor. Outputs fix>m the execution unit are latched and then driven over a data 
bus for use by another functional unit Each fonctional unit is also embedded vpxth some 
control bg^ to albw the unit to be controlled from dae execution \vord of the processor. A 
method operand is extracted that selects vMch particular operation the execution unit should 
perform. 

A detailed architecture overview of a functional unit is shown in Hgure 10. This shows the 
internal connectivity between the constituent bbcks. Both data si^iak 1012 and control 
signals 1013 are shown. The diagram shows an execution unit with two operand ports and 
one output port However, the number of both input and oulpvit ports is completely 
configurable. 

The constituent bbcks ate as folbws: 

^Execution Unit: The execution unit 1001 receives operand data values fi:om the operands 
1014. These operands are fed by operand selectors 1002. In general a particular operand wQl 
have multiple potential data sources 1003. However, if onty one data source is required then 
an operand port thssy be direcdy connected to the external data bus, avoiding the need for an 
operand selector. A method selection 1015 is obtained &om die controller unit Hiis is 
extracted direcd^ fix>m the execution word 1004 but is delayed by one cbck cycle by the 
controller so its presentation is in synchronization with die associated operands. The select 
flag 101 6 is asserted if the execution unit has been selected to perform a new operation during 
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the dock cyde. If the flag is &lse thea die method and opetand inputs axe undefined The 
execudon unit generates a number of results 1017. 

BConttoUen The controller 1007 reads die opcode portion of the e^cecution word and 
compares the code against the fixed sdection code 1008 for the unit If there is amatch then 
the unit is bdng sdected The predicate mask 1005 shows the status of various conditions. 
The predicate condition associated with the operation bdongs is specified as part of die 
execution word An operation is onty performed if that predicate is true. Hnally, die wait flag 
1 006 is used to indicate a pipeline stall and is used to prevent fimher operations being issued 
to the unit • 

BOpetand Selector(s): An operand selector 1002 is a simpty a multiplexer for sdectitig one 
of a number of data inputs firom data buses. A portion of bits from the execution word is 
used to specify the bus to be sdected These bits are r^stered so that they are delayed by one 
dock cyde. This causes the data steering to be performed a cyde later than the execution 
word distribution, as is required 

SDday Pq>dine: Hie dday pipeline 1010 simply ddays the output reg^ter mask by a number 
of dockcydes. The dday period is one less than (he latency of the execution unit Thus if the 
execution unit has a latency of one then no dday pipeline is required This aUows the correct 
output testers to be updated -vtiien the results £com an operation are available. 

SOutput R^plsters: The number of output registers .1009 is equal to die number of output 
connections 1011 fix>m tiae unit As the execution unit genetates results and this is teg^tered 
in one or mote of the output registers. An output register mask is indaded in the execution 
word and spedfies which c^jisters should be latdied for any given operation. New data ficom 
the execution is oofy r^^stered if it is produdng a valid output during that cyde. Data remains 
in the output register until explicidy overwritten by subsequent operations perfomied on a 
fimctional unit 

Figure 9 shows g^ter detail of the control plane for a functional unit The area 903 shows the 
bits within the execution word that are used to control the functional unit This is composed 
of a number of dififerent sectbns for describing difiFerent aspects of the operation to be 
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petfoimed by the functional unit lUs field is fomied fiom a subset of bits fiom an ovetall 
execution wosA used to conttol all of the funcdonai units \vithin the atchitectute. 

The sub-field 904 controls which of the output registers should be updated vrith a result The 
sub-field 905 is used to control the operand multq)lexets 911 to select the correct source of 
data for an operation. This data is fed to the execution unit via the operand inputs 909. The 
sub-field 906 provides the method that should be performed by the execution unit and is fed 
to the unit via the input 910. 

The optional sub-fidd 907 provides the number of a predicate flag \rihich controls the 
execution. Tbis is used to select the corresponding status bit fix>m a predicate status mask 912 
via the multiplexer. This is a global state accessible to all fianctional units that indicates 
ctynamically >»hich instructions should be completed This is used to condition the fimctional 
unit select so that if a particular instmction is disabled then die fimctional unit operation is not 
performed. Certain fimctional units may be executed unconditionally and do not requite such 
afield 

The opcode bits 908 are used to select the particular fimctional unit If the opcode does not 
have the required value then all of the other bits are considered to be undefined and the 
fimctional unit performs no operation. 

• QmtioDerUnit 

The controller g^ue unit is responsible for g^etating the unit select signal, the result selector 
and the output r^^ster mask. The unit select signal is asserted if a nevtr operation is being 
initiated on the execution unit durtog the cycit. The output register mask is used to control the 
registering of new data in the output registers of the fimctional unit 

There are two distinct types of controller unit depending upon whether execution is 
conditional on a particular predicate. If an execution unit has no side eflfects then it can be 
executed unconditionally as a speculative use of the unit will not have any permanent side 
efiGscts. Units that result in side efiGscts (such as any unit witii internal state such as register file 
or memory unit) must always be executed conditionally. Unconditional units can have a more 
compact representation than conditional units as no predicate number needs to be specified as 
part of the execution word. 
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Conditional ConttoUer 

Figuie 11 shows die iatetaal atchitectute and connectivity for a conditional conttoller g^e 
miit Hie fgute shows both data si^pals 1115 and control signals 1114. A conditional 
controller 1101 is used when the execution. unit has internal state so diat certain metiiods 
cannot be esecuted specukthnefy. A ptedicate sefector field 1106 is used to sdect 1107 the 
apptopnate bit firom a predicate status mask 1105. Ibis is used to gate both the unit select 
1109 and die output reg^er mask 1110. An incoming opcode 1103 is compared against a 
fixed selection code 1104 using die comparator 1108. Hie output of this is also used to 
the unit select 1109 and output register mask 1110. 

The register 1102 is used to delay the unit select 1109 so that it is valid during the execute 
cycle of the execution unit Another register is used to hold the conditioned form of the 
output r^^ster mask 1113. This ensures that the output reg^ters are not updated if a unit is 
not selected during a q^de. Finally the mediod 1112 is simply delayed by one dock cycle to 
generate die method 1111 that is in syndironization with die other signals. 

Unconditional Conttoller 

Hguie 12 shows the internal architecture of an unconditional controller g|be unit Bodi data 
signals 1210 and control signals 1211 are shown. An unconditional controller 1201 is used 
vibm the execution imit has no internal state so all methods can be performed speculadvdy. 
The opcode 1202 is compared against the fixed selection code 1203 using the comparator 
121Z This g^erates a signal that conditions the unit sdect 1204 and output r^ter mask 
1205. The unit sdect is registered 1207 to g^erate a unit sdect during the execute eyde of the 
execution unit The output register mask 1208 is conditioned and reg^tered to produce the 
value 1205 in the correct clock cycle. The method field 1209 is ^ply registered to generate 
the method 1206 in the correct execution cyde. 

Operand Selector Unit 

Figure 13 shows the basic structure of an operand sdector unit An operand selector 1301 is 
primarily composed of a multiplexer 1303 that directs the contents of a particular data bus 
1302 to die output operand 1304. The output operand is then fed to the ii^ut of an execution 
unit The switching of the multiplexer is performed during the first execution cycle of die 
execution unit The multiplexer is controlled direcdy by specific bits in the execution word 
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1306- Hiese ate distributed during the decode cycfe and are held in the icgstea: 1305 until the 
&st execution cycle. 

Dday Fq)dine Unit 

The delay pipeline simply delays the Ou^ut Register Mask bits so d^it they are available on 
the dock qrcle during which the results &om the execution unit are generated The delay 
p5)eline is only required if the latency of die execution unit is greater than one dock. The 
dday required in the pjqpdine is one less than the latency of the execution unit 

The architecture of the delay pqieline is shown in Egure 14. The dday p^dine 1401 contains 
a number of r^ter stagps 1404 Aat dday the input 1402 relative to the output 1403. 

Output Regtsteis Unit 

IThe output registers unit hold results &om a particular result port of an execution unit The 
ardiitecture is shown in F%ure 15. The output reg^ters unit 1501 may drive a number of 
connection buses to the operands of other functional units 1503. Each bus has an associated 
re^er 1502. The output £^;ister mask (dsat is specified as part of the execution word) 1505 
determines vdaich particular reg^stets are vjxlated by an opetation executed on the unit The 
data 1506 is obtained &om the execution unit Aa optional test chain 1504 allows the state of 
the tegistets to be read and written by a debug system. 

Copying Operands 

Certain functional units are able to perform a copy opetation as a side efifect of thdr normal 
opetatiorL Copying functionality is required to ensure that all the opetands in the processor 
are GjSy teadiable fix>m all the resuks. That is» it is possible to move any result tx> any operand 
unit via a sequence of copy operations if no direct physicd connection is available between the 
tmits. The processor architecture is con%2red so that tills is the c^ 

This copy mechanism has the advantage that it makes use of a side effect of the units 
operation to perform a copy. Thus very litde additional log^c is required to si^port the 
copying fiinctionality. 

Figure 1 pmviHftS an oypfyj^ rnpying mfyVi^anifitn Thff iTi<*rViam> yp rfiiip^ On tiie £act 

that operating with the value 0 is an identity operation for a large number of unit types. The 
example shows the use of an adder unit 1 01 for pto^viding a copy but tiie technique is equally 
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applicable to logical units and vacbus other unit types. Addition of 0 to an operand copies die 
input operand to the result 

If a copy is to be performed from the upper operand then the lower operand is set to have a 
value of 0 as shown in 103. This is achieved by fixing the 0 selection for the operand selector 
to be tied to the literal value 0. The upper operand is added to 0 thus producing a copy of the 
input value on the output Conversety^ if a copy is to be made of the lower operand dien the 
iqpper operand is set to 0 as shown in 104. The input operands to the unttitself are shown as 
102 and the copied results are held in die output regkters 105. Ihe operand mult^lexers 106 
are used to select the appmpdate ic^ut data. 

A special copy method is nominated in the definition of the functional unit that can be used as 
copies. Such a method must be able to use 0 to perfi3rm an identity operation if there are 
multiple input operands diat may be copied 

Thb section describes the cyde level timing of various activities in die processor p^)eline. The 
architecture is dbaractierissed by having a short and hi^y regular 

flexibility to allow functional units with arbitrary length internal pipelines to be included in the 
processor. Due to the partitioned nature of the control flow paths in comparison to the data 
palhsy they have separate pipefines. . 

The control path uses a very simple three stage pipeline reminiscent of ead^ RISC 
architectures. The three stag^ are fetch, decode and execute. During the fetch stage the nest 
elocution word is read form the instruction cache. During the decode stag^ the execution 
word is distributed to the functional units and the appropriate s^pents are decoded by the 
units. Rna% dudng die execute cyde the operations are presented and initiated in the 
^propdate functional units. 

Each of the functional units has its own, independent, pipeline controlled from a master 
dock. The lengdi of the execution pq)eline for each unit is specified in the execution unit 
modd as its latency. The code generator automatically takes account of the length of 
execution unit pipelines ia the management of the data flow between functional units. The 
indq)endent specifik:ation of the pipdtue leiigth for each execution unit aflows great flexibility 



wo 2004/003777 



PCT/GB2003/002774 



19 

in the constmcdon of the individual units. Each functioflal unit can generate a wait s^naL If 
this is asserted then the entire p^>eline of the processor is stalled Ihis aflows the 
implementadon of execution units that sometimes lequice an extended latency penod. For 
instance, it can be used for cache memory units Tivhere the latency is bng^ if a particular data 
item is not present in the cache. 

The short pipeline allows the branch that occurs at the end of a region to occur \wlhout any 
p^eline bubbles. The last execution of re^on can occur back-to-back xidth the first execution 
cycle of the succeeding one. 

Instruction Uming 

Figure 16 provides a timeline of instruction execution in the architecture. During the fetch 
cycle 1601 the EWA (Execution Word Address) 1606 is uSed to address the instruction cache 
1607 and obtain an execution word During the decode cycle 1602 the appropriate bits from 
the word are distributed 1608 to all the functional units in the system. Each of diese has 

i 

comparison logic embedded within the controller g^ue unit to determine if an operation for 
that unit has been selected If so then an operation on functional unit is initiated All 
fimcrional units have at least one execute cyde 1603, 1604 and 1605. The data buses 1612 
distribute results &om functional units to the inputs of otiier functional units. Each functional 
unit has a defined latency. The pq)elines of the functional units run independently and do not 
iaffect the riming of instruction fetching and decoding. The example shows a single cyde 
functional unit 1611, a multi-cyde fianctional unit 1610 and a memory unit 1609. 

In a typical RISC pipeline all fimcdonal unit pipelines must be completed by a write back of 
results into a oentraUsed register file. Ihus the individual functional unit pipelines are 
intimatefy tied into the overall control pipeline of the processor as appropriate &ed forward 
paths must be managed to feed data fiom reg^ter writes to subsequent register reads. Since 
the processor uses software to manage access to r^^ster files (they are treated like any other 
functional unit) the flinctional unit pipelines can be effectively separated &om the overall fetch 
and decode pipeline. The code sdieduling manages the data bus resources and ensures resuks 
are only read when they ate available from the outputs fi:om functional units. 
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Data Flow Timitig 

Figufe 17 jllusttates die data flow timing of fbncdonal units in the piefected embodiment It 
shows apatticular dynmiic data path thi?ou^ the functional units as results ate passed £tom 
one unit to die next Each dock cydc boundaiy is shown as 1705. The initial tesult is 
ptoduced fiom a sin^ qrcle functional unit 1701. Ihe tesult is calculated during cyde 1. It is 
latched by die output t^tet in the unit at die end of qrde 1 and dien driven onto die output 
bus during cyde 2 At the statt of cyde 2 die tesuh is steeted into a two-cycle functional unit 
1702 opetand. It is operated upon during the temaindet of that cyde and during cyde 3. At 
the end of cyde 3 it is latched into the output teg^stier and die tesult driven during cyde 4. It is 
then steeted into anodier sin^ cyde execution unit 1703. FmsSfy it is hdd in the output 
legatee for an extra cycle ^^diile other opecands for a subsequent operation become available. 
During cycle 6 the data item is written into a memory unit 1704. 

R^on Suooession Mechanism 

The Gonttol medianism for petfbtming a region succession is iUustcated in Figure 1 8. Sudi a 
succession occurs when the end of a teg^n is readied 

The destination address is determined prior to the end of the region and put into t^^ster 
1812, A suffident number of dock cydes are left between the resolution of the last potential 
branch in the region and the last execution wotdin die region (m which the ERF flag is set). 
This will leave a new instmction address available that has been looked up &om the 
instruction cache. 

The medianism aOows a flag to be set on die last instruction of a r^on (ERF) and to 
immediately initiate a succession so that the first instruction ftom die new r^on can be 
executed without any further latency. 

The instruction 1804 is die last to be executed in a regioiL Thus it has the End R^on Flag 
(ERF) set 1805 which is used to control a mdtiplezer 1811 diat selects die next execution 
address fiom ddier the Ei^cutbn Word Address (EWA) 1813 or die new address 1812 The 
next execution address is app&d to the instruction cache 1810. This selection can be 
performed during the same cycle as the access itsd^ thus aOowing very quick address steering. 
The EWA is incremented 1807 by one on each cyde 1814 so that execution is advanced 
ihiougjh the region. Thus die first instruction 1808 of the new regjbn is executed as EWA is 
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loaded \wdi tibe new address plus one. lie BSF for the first insttuction of the new region is 
reset 1806, causing die selection of die EWA pointing to die second instruction of die rcgioa 
Tbus instcucdon 1 809 is die second to be executed fiom die new rcgbn. 

Each instmction consists of a fetch cycle 1801, a decode cycle 1802 and an execute cycle 1803. 

Btandi ContfnlUoit 

Branches operations may be issued to dae branch unit Branch operations only load die 
required destination information into die branch registers vn&m die branch control unit The 
actual branch k not performed until die end of die.reg^on is reached. Thus a m\alti-way branch 
is resolved at die end of die r^on execution* 

The branch control unit determines ^Wch region vrfll be executed next The unit is able to 
handle multirway branch conditions. A number of branch destinations with associated 
|x>nditions may be issued in a re^on. Tlie branch contcol unit determines which branch will 
be taken on die basis of vMdi conditions evaluate to true and die relative priority of the 
branches. 

RegLon Branch State (BBS) 

The BBS is a register that holds the current state for a destination branch selected fix>m a 
r^on. The RBS has three possible states as listed bdow: 

fiDe&uh: This is the de&ult state diat indicates that there should be a M throu^ to the 
following r^on ^en the execution of the current one is con^leted. 

gRestart: This indicates that the current region should be re-cxecuted when die current 
execution is completed. The region address is obtained from the Region Base Address 
register. 

gBtanch: Indicates that the branch control unit has selected a branch destinatioa Branches 
are given a static priority \(diich m^ differ fix>m their issue order. The branch issued ^with the 
higjhest prion^ and a true condition is sdected 

The state transitions are detailed in Egure 19. Hie initial state is De&ult 1901. The other 
possible states are Branch 1902 and Restart 1903. If a branch is selected 1904 then a transition 
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is made &om the De&uk to the Bianch state. A xebmx to the De&dt state is toade if the end 
of die tegon is reached. Earlier htanches may be selected 1905 while iolBtanch mode witfiout 
c h a nging die state. la some dtcumstances a data hazatd may requite a re-executioa of a 
region. The ttansidon 1 909 is dien made firom De&ult state to Bestatt state to anang^ for the 
r^on to be ic-executed to resolve the hazard. If in Branch state and a region execudon needs 
to be repeated due to a data haasaid then a ttansMon 1907 is made tn the Restart- sfq»y„ PinalTy ^ 
if in Restart state and a branch eadier than the cause of a restart performs a branch then the 
transition is made to the Branch state. 

R^on Base Address (RBA) 

The RBA register holds the address of the start of the region currentty being executed. It is 
loaded with the value of Next R^on Address (NRA) at the start of each new re^on 
execudon. It is used to generate the nest value of NRA if a region is to be restarted. 

Next R^on Address (NBA) 

.The NRA r^tet contains die address of die next r^on diat is to be executed The bran<;h 
resohidon unit calculates the NRA as each branch is issued. Ite h^^iest priority branch s 
selected as the destinarioh. The branch does not occur until the end of the current r^on is 
reached The NRA consists of two fields. There is a full destination address that allows the 
specification of an address in main memory. 

Ihe usd of the NRA is illustrated in Figure 5. When the end of the region is reached die NRA 
r^ter 501 is used to find the correct entry in die instrucdon cache and to set die inidal 
predicate status bits. The address, of the r^on widiin the instrucdon cache 502 is looked up 
504 in the instruction cache and is then baded into EWA, fiom where execudon of the region 
is commenced The bwest predicate to be set 503 is converted into a mask 505 showing 
which predicates are vaM and then loaded into a predicate iiiask r^ter 50^^ 

Branch Resdudon Architecture 

Ihis unit is responsible for selecting a destination address. The structure of the branch 
resolution unit is shown in I^;ure 7. A branch destination address is supplied 707 by die 
branch fimctbnal unit A muh^ilexer 711 selects between that address and the RBA 703 on 
the basis of the loop flag 706 supplied fix>m the branch fimctional unit This allows a branch 
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tx> be issued that causes a branch to the start of the te^on without having to specify a 
destination address. 

A Next Execution Address (NEA) 702 is used to hold die address of a foUo^ving region to be 
executed in the absence of a branch being issued in a rcg^oa 

A brandi prfority and predicate condition 708 is sn)plied fiom the branch functional unit 
Multiplexer 711 selects the defeult state 709 if a bop is being performed. This is compared 
against the previously higjiest priority fiom the current Next R^on Address (NRA) 704. If a 
branch priority is h^ier tiian a previousfy selected branch then it is used instead. A 
demukqdexcr 710 is used to determine if die branch predicate 705 is true. Ihe block 701 is 
responsible for maintaining the KBS state machine and selecting destinations as required It 
supplies a squash vector 706 to aprcdicate contcol unit 

At the end of a rcgjion exeoirion the NRA 704 holds die destination address that is sop^lkd 
ito the instruction cache via 707. 

• t 

It is understood that thete are many possible alternative embodiments of the invention. It is 
recognized that the description contained herein is onty one possible embodiment This 
should not be taken as a limitation of the scope of the invention. The scope should be defined 
by the dainos and we therefore assert as our invention all that comes within the scope and 
spirit of those daims. 
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CLAIMS 

1. . A mictoptocessot with an aichitectute incorpotating a number of execution units, 
whereby: 

a number of inters store results fiom particular execution units; 

execution unit operands ms^ receive data fiom a number of diese registers; and 

certain execution units are abk to copy data &om their operands to results. 

2 The microprocessor according to daim 1 whereby otie or more of die execution units 
-may be register files. 

3. The microprocessor according to daim 1 whereby the set of regjisters associated widi 
a particular execution unit to be written may be specified for eadi operation. 

^. The microprocessor according to claitri 3 whereby the specification of registers to 
write is represented in the instmction format 

5. Ihe microprocessor according to daim 4 whereby die qiedfication of registers to 
write is ddayed in a pipdiac so as to be available on the same dock cyde as the 
results. 

6. The microprocessor according to daim 1 wtiexchy die connectiviiy between execution 
* units is known to the code generation sofiware tools. 

7. The microprocessor according to daim 1 whereby the available execution units are 
specified in a library file. 

8. Ihe microprocessor according to daim Iwhscthj the cotmectivily of execution units 
to other units in the system is confutable. 

9. The microprocessor according to daim 8 whereby the number of ou^ut registers 
associated widi an execution unit is configurable. 

10. The microprocessor according to daim 1 hereby the update of the output r^;isters 
is dependent on global condition state for certain execution units. 
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11. The microptocessor accoiding to daim 10 vhstehy the state used to conttol die 
oulput register i^>date is selectable as patt of the instruction set 

12. The microprocessor according to rlairp 1 whereby certain identity operations may be 
issued to an ejection unit in order to perform a copy. 

13. Ihe microprocessor according to daim 1 '^dieceby the operation of certain bits wiAi 
an execution word contcol certain execution units on a cyde by cyde basis. 

14. The microprocessor according to claim 13 ivhereby the number of bits required to 
conttol each execution unit varies depending upon (he extent of its connectivity. 

15. The microprocessor according to daim 13 whereby certain bits within the execution 
word for each execution unit sdect difiFerent types of operation to be performed. 

16. The microprocessor according to daim 1 whereby each oulput register may be 
connected to one or more execution unit operands. 

17. The tnicroprocessor according to daim 1 wherdDy die source register for a particular 
execution unit operand may be spedfied by the instruction set 

18. Hie microprocessor according to claim 1 whereby the processor executes a sequence 
of contigVLOus execution words. 

19. The nucroprocessor according to daun 18 ^tdieteby, when the end the execution 
word sequence is reached, exiecution may branch to one of a number of difiecent 
esmitioa word addresses. 

20. The microprocessor according to claim 19 whereby the same execution word 
sequence may be repeated to resolve a data hazard. 

21. The microprocessor according to daim 20 whereby there is a branch control unit for 
determining the destination of such branches. 

' 22. The microprocessor according to daim 21 \x^ereby the branch control unit may 
accq>t branches out of thek sequential order. 
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23. The mictoptooessojt accordiog to daim 22 \d1eteb7 (he btanch contcol unit may 
disable die opecadon of certain subsequent opetations dq>eQding on die sequential 
position of an acceptsed btancb. 

24. A mediod of operation used in a mictoprooessot wida an aixJittectut e incorpotadng a 
number of execution units, whereby: 

anumber of regjlsters store results fix>m patdculat execution units; 

execution unit operands may receive data fiom a number of diese raters; and 

certain execution units are able to copy data fcoin their operands to results. 

25. The method of daim 24 as used in a microprocessor as defined in any preceding 
cIaim2-23. 
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